City College of San Francisco
MATH 108 - Foundations of Data Science
Lecture 09: Charts¶
Associated Textbook Sections: 7.0, 7.1
Overview¶
Set Up the Notebook¶
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
W. E. B. Du Bois¶
Background¶
The content of the following podcast, video, and images contains references to slavery, lynching, and the historical use of the word negro.
- Scholar, historian, activist, and data scientist
"The Philadelphia Negro was the first scientific study of race in the world. [...] the first non-racist investigation of a non-white poulation in the world. [...] one of the first social scientific written in the U.S. using the advanced statistical methods of the time." - Dr. Tukufu Zuberi, Professor of Race Relations at the University of Pennsylvania (Source: A Legacy of Courage: W.E.B. Du Bois and the Philadelphia Negro)
- First Black American to receive a PhD from Harvard
- NAACP founder
- Made a series of visualizations for the 1900 Paris Exposition
- Goal: Change the way people see Black Americans
- Hundreds of photographs and patents
- 60+ handmade graphs in 3 months
"All art is propaganda, and ever must be, despite the wailing of the purists. I stand in utter shamelessness and say that whatever art I have for writing has been used always for propaganda for gaining the right of black folk to love and enjoy. I do not care a damn for any art that is not used for propaganda." - W.E.B. Du Bois
- Compared with Booker T. Washington
The following podcast provides an 11 minutes overview of these two leaders.
from IPython.display import IFrame
IFrame('https://open.spotify.com/embed/episode/6MdipyUuPK2bbXF0n2CYA1?utm_source=generator',
width=500, height=350)
Images from Paris Exposition¶
Why Do We Visualize Data¶
- A large fraction of our brains are dedicated to visual reasoning.
- In Data Science we use visualization:
- For others – to communicate our findings
- For ourselves – to understand our data, see patterns, and discover relationships
Demo: Identifying Data Type of Column Values¶
Load the actors.csv data. The 'Total Gross', 'Average per Movie', and 'Gross' values represent Thousands of Dollars
actors = Table().read_table('./data/actors.csv')
actors
| Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
|---|---|---|---|---|---|
| Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 |
| Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
| Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
| Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415 |
| Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 |
| Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2 |
| Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3 |
| Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3 |
| Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 |
| Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4 |
... (40 rows omitted)
The actor's name is a categorical attribute.
# identifying the type of the Actor variable
type(actors.column('Actor').item(0))
str
The total gross dollar is a numerical attribute.
type(actors.column('Total Gross').item(0))
float
Course Visualizations¶
- In the course we will mostly use the following visualizations:
- Histograms
- Line Graphs
- Scatter Plots
- Bar Charts
- You will need to overlay graphs to explore relationships
- How you visualize your data depends on attribute type
- The data type doesn't determine numerical/categorical attribute label.
'$12.00'is astrand likely to reflect a numerical attribute- The context of the data and analysis is important to understand
You will indirectly work withe standard Matplotlib library for data visualization using the datascience library. You can optionally interact with visualizations using the Plotly library, but customizing and creating interactive visualizations is not required and you will not be tested on these things.
Good Practices¶
- Less can be more
- Minimize decoration
- Choose colors carefully: Minimize the number of different colors
- If data are numerical, preserve their relative values and distances between them
See Edward Tufte's "The Visual Display of Quantitative Information" for additional suggestions.
Categorical Data¶
(Horizontal) Bar charts barh are a standard way to visualize the distribution of a single categorical variable.
A Bar Chart¶
The following code uses group. We will address that later in the course. Additionally, there is customization to the visual done on the lines that start with plots. You are not responsible for this customization.
cones = Table().read_table('./data/cones.csv')
cones_grouped_by_flavor = cones.group('Flavor')
cones_grouped_by_flavor.barh('Flavor')
plots.title('Distrubtion of Ice Cream Flavors')
plots.show()
# can see what the grouped table looks like - the data used for the bar chart
cones_grouped_by_flavor
| Flavor | count |
|---|---|
| bubblegum | 1 |
| chocolate | 3 |
| strawberry | 2 |
Demo: Bar Charts¶
The dataset top_movies_2023.csv shows the highest 1,000 grossing movies world wide listed on IMDB. Adjusted total gross values were also provided for data before 2021 using the Consumer Price Index (CPI)-based Python library cpi.
top_movies = Table.read_table('./data/top_movies_2023.csv')
top_movies
| Created | Modified | Title | URL | Title Type | IMDb Rating | Runtime (mins) | Year | Genres | Num Votes | Release Date | Directors | Gross | Gross (Adjusted) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023-01-06 | 2023-01-06 | Gone with the Wind | https://www.imdb.com/title/tt0031381/ | movie | 8.2 | 238 | 1939 | Drama, Romance, War | 318271 | 1939-12-15 | Sam Wood, George Cukor, Victor Fleming | 402382193 | 7.84414e+09 |
| 2023-01-06 | 2023-01-06 | Bambi | https://www.imdb.com/title/tt0034492/ | movie | 7.3 | 69 | 1942 | Animation, Adventure, Drama, Family | 145676 | 1942-08-09 | Samuel Armstrong, Paul Satterfield, Graham Heid, James A ... | 267447150 | 4.44602e+09 |
| 2023-01-06 | 2023-01-06 | Titanic | https://www.imdb.com/title/tt0120338/ | movie | 7.9 | 194 | 1997 | Drama, Romance | 1187108 | 1997-11-01 | James Cameron | 2201647264 | 3.71701e+09 |
| 2023-01-06 | 2023-01-06 | Avatar | https://www.imdb.com/title/tt0499549/ | movie | 7.9 | 162 | 2009 | Action, Adventure, Fantasy, Sci-Fi | 1318546 | 2009-12-10 | James Cameron | 2922917914 | 3.69178e+09 |
| 2023-01-06 | 2023-01-06 | Snow White and the Seven Dwarfs | https://www.imdb.com/title/tt0029583/ | movie | 7.6 | 83 | 1937 | Animation, Adventure, Family, Fantasy, Musical, Romance | 202792 | 1937-12-21 | William Cottrell, Ben Sharpsteen, David Hand, Perce Pear ... | 184925486 | 3.47981e+09 |
| 2023-01-06 | 2023-01-06 | Star Wars | https://www.imdb.com/title/tt0076759/ | movie | 8.6 | 121 | 1977 | Action, Adventure, Fantasy, Sci-Fi | 1372821 | 1977-05-25 | George Lucas | 775398007 | 3.46716e+09 |
| 2023-01-06 | 2023-01-06 | Avengers: Endgame | https://www.imdb.com/title/tt4154796/ | movie | 8.4 | 181 | 2019 | Action, Adventure, Drama, Sci-Fi | 1144892 | 2019-04-22 | Anthony Russo, Joe Russo | 2797501328 | 2.96506e+09 |
| 2023-01-06 | 2023-01-06 | The Exorcist | https://www.imdb.com/title/tt0070047/ | movie | 8.1 | 122 | 1973 | Horror | 413376 | 1973-12-26 | William Friedkin | 441306145 | 2.69326e+09 |
| 2023-01-06 | 2023-01-06 | Jaws | https://www.imdb.com/title/tt0073195/ | movie | 8.1 | 124 | 1975 | Adventure, Thriller | 612946 | 1975-06-20 | Steven Spielberg | 476512065 | 2.40001e+09 |
| 2023-01-06 | 2023-01-06 | Star Wars: Episode VII - The Force Awakens | https://www.imdb.com/title/tt2488496/ | movie | 7.8 | 138 | 2015 | Action, Adventure, Sci-Fi | 936837 | 2015-12-14 | J.J. Abrams | 2069521700 | 2.36598e+09 |
... (990 rows omitted)
Since Gone with the Wind has been re-released several times, the adjusted price is not the most honest representation of its adjusted gross proces. For a more comparable analysis, reduce the table to the top top 10 movies based on actual gross values ('Gross (Adjusted)') for the movies releasted in the last decade.
top_movies_select = top_movies.select('Title', 'Year', 'Gross (Adjusted)')
top_movies_last_decade = top_movies_select.where('Year', are.above(2012)) # from last decade
top_movies_last_decade_sorted = top_movies_last_decade.sort('Gross (Adjusted)', True) # sort by gross adjusted
top10 = top_movies_last_decade_sorted.take(np.arange(10)) # take the top ten
top10
| Title | Year | Gross (Adjusted) |
|---|---|---|
| Avengers: Endgame | 2019 | 2.96506e+09 |
| Star Wars: Episode VII - The Force Awakens | 2015 | 2.36598e+09 |
| Avengers: Infinity War | 2018 | 2.21039e+09 |
| Spider-Man: No Way Home | 2021 | 1.91631e+09 |
| Jurassic World | 2015 | 1.91099e+09 |
| The Lion King | 2019 | 1.76269e+09 |
| Fast & Furious 7 | 2015 | 1.73242e+09 |
| Avengers: Age of Ultron | 2015 | 1.60376e+09 |
| Frozen II | 2019 | 1.53688e+09 |
| Avatar: The Way of Water | 2022 | 1.51656e+09 |
Convert to the gross (adjusted) values to billions of dollars for readability.
billions = np.round((top10.column('Gross (Adjusted)') / 1000000000), 2)
top10 = top10.with_column('Gross Adjusted, billions', billions)
top10
| Title | Year | Gross (Adjusted) | Gross Adjusted, billions |
|---|---|---|---|
| Avengers: Endgame | 2019 | 2.96506e+09 | 2.97 |
| Star Wars: Episode VII - The Force Awakens | 2015 | 2.36598e+09 | 2.37 |
| Avengers: Infinity War | 2018 | 2.21039e+09 | 2.21 |
| Spider-Man: No Way Home | 2021 | 1.91631e+09 | 1.92 |
| Jurassic World | 2015 | 1.91099e+09 | 1.91 |
| The Lion King | 2019 | 1.76269e+09 | 1.76 |
| Fast & Furious 7 | 2015 | 1.73242e+09 | 1.73 |
| Avengers: Age of Ultron | 2015 | 1.60376e+09 | 1.6 |
| Frozen II | 2019 | 1.53688e+09 | 1.54 |
| Avatar: The Way of Water | 2022 | 1.51656e+09 | 1.52 |
Visualize the gross adjusted values for each of the top 10 grossing (adjusted) movies.
top10.barh('Title', 'Gross Adjusted, billions')
# 1st argument is categorical variable, 2nd argument is column label of frequencies (horizontal axis),
# if left blank, it will try to plot all other columns with a legend
plots.title("The Top 10 Grossing Movies")
plots.show()
Visual Perception Accuracy¶
From Nathan Yau’s Data Points: Visualization that Means Something, our eyes can extract information at different levels of accuracy depending on the design.
For this reason, pie charts are generally discouraged because most people have a difficult time visually interpreting angles compared to lengths of bars.
Demo: Visualizing Du Bois¶
Read the du_bois.csv data as a table, reformat the data, and create a stacked bar chart.
# These data are in the visual 'Income and Expenditure...' above
du_bois = Table.read_table('./data/du_bois.csv')
du_bois.set_format('RENT', PercentFormatter)
du_bois.set_format('FOOD', PercentFormatter)
du_bois.set_format('CLOTHES', PercentFormatter)
du_bois.set_format('TAXES', PercentFormatter)
du_bois.set_format('OTHER', PercentFormatter)
du_bois
| CLASS | ACTUAL AVERAGE | RENT | FOOD | CLOTHES | TAXES | OTHER | STATUS |
|---|---|---|---|---|---|---|---|
| 100-200 | 139.1 | 19.00% | 43.00% | 28.00% | 0.10% | 9.90% | POOR |
| 200-300 | 249.45 | 22.00% | 47.00% | 23.00% | 4.00% | 4.00% | POOR |
| 300-400 | 335.66 | 23.00% | 43.00% | 18.00% | 4.50% | 11.50% | FAIR |
| 400-500 | 433.82 | 18.00% | 37.00% | 15.00% | 5.50% | 24.50% | FAIR |
| 500-750 | 547 | 13.00% | 31.00% | 17.00% | 5.00% | 34.00% | COMFORTABLE |
| 750-1000 | 880 | 0.00% | 37.00% | 19.00% | 8.00% | 36.00% | COMFORTABLE |
| 1000 and over | 1125 | 0.00% | 29.00% | 16.00% | 4.50% | 50.50% | WELL-TO-DO |
Notice that the table is formatted to show percentages, but the values in the % columns are actually floats.
# to see this..
du_bois.column('RENT')
array([ 0.19, 0.22, 0.23, 0.18, 0.13, 0. , 0. ])
type(du_bois.column('RENT').item(0))
float
For a quick review, find the income bracket (CLASS) that spent the highest percentage of their income on rent.
du_bois.sort('RENT', True).column('CLASS').item(0)
'300-400'
Start to re-create the bar chart that Du Bois presented in Paris.
# since barh will plot all columns that it can make sense of, we drop the columns not in this visual
du_bois_for_bar = du_bois.drop('ACTUAL AVERAGE', 'Food $', 'STATUS')
du_bois_for_bar
| CLASS | RENT | FOOD | CLOTHES | TAXES | OTHER |
|---|---|---|---|---|---|
| 100-200 | 19.00% | 43.00% | 28.00% | 0.10% | 9.90% |
| 200-300 | 22.00% | 47.00% | 23.00% | 4.00% | 4.00% |
| 300-400 | 23.00% | 43.00% | 18.00% | 4.50% | 11.50% |
| 400-500 | 18.00% | 37.00% | 15.00% | 5.50% | 24.50% |
| 500-750 | 13.00% | 31.00% | 17.00% | 5.00% | 34.00% |
| 750-1000 | 0.00% | 37.00% | 19.00% | 8.00% | 36.00% |
| 1000 and over | 0.00% | 29.00% | 16.00% | 4.50% | 50.50% |
du_bois_for_bar.barh('CLASS')
# Some extra graph formatting you are not responsible for
plots.title('W.E. Du Bois Income and Expenditure')
plots.show()
[Optional] Interactive Charts with Plotly¶
- By default, we will be using the static visualizations that are made using the Matplotlib library.
- You have the ability to access interactive Plotly visualizations by adding an
iin front of the table method name that creates the default visual. - The arguments change to fight the Plotly functions.
[Optional] Demo: Visualizing Du Bois with Plotly¶
Create the interactive version of the bar chart.
du_bois_for_bar.ibarh(
column_for_categories='CLASS',
title='W.E. Du Bois Income and Expenditure',
xaxis=dict(tickformat='0.1%')
)
Plotly has an easy way to stack the bars to create an overlaid bar chart.
# barmode and xaxis are available with ibarh because they are a Plotly arguments
fig = du_bois_for_bar.ibarh(
column_for_categories='CLASS',
barmode="stack",
title='W.E. Du Bois Income and Expenditure',
xaxis=dict(tickformat='0.1%')
)
We are starting to get something that looks like Du Bois's visual, but let's stop there because this is optional for this class. If you like creating visualizations, try to read through the Plotly documentation or Matplotlib documentation to update the colors, add overlaid text, etc.
Numerical Data¶
Visualizing the Distribution of One Numerical Variable¶
Histograms tbl.hist are a standard way to visualize the distribution of one numerical variable.
Histograms will be focused on in the next lecture.
A Histogram¶
# histogram shows how one numerical variable is distributed
actors.hist('Total Gross', unit="Thousands of Dollars")
# Some extra graph formatting you are not responsible for
plots.title('Distribution of Total Gross')
plots.show()
Plotting Two Numerical Variables¶
Line graphs tbl.plot and Scatter plots tbl.scatter are standard ways to visualize the relationship of two numerical variables.
A Line Graph¶
# line graphs show how a numerical variable changes over time (most often)
top_movies = Table.read_table('./data/top_movies_2023.csv')
movies_per_year = top_movies.group('Year').relabeled('count', 'Number of Movies')
movies_per_year.where('Year', are.above(1999)).plot('Year', 'Number of Movies')
plots.xticks(np.arange(2000, 2023, 5))
plots.title('Number of Movies vs. Release Year')
plots.show()
A Scatter Plot¶
# scatter plots show how one numerical variable relates to another
actors.scatter('Number of Movies', 'Average per Movie')
plots.title('Average Pay per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()
When to use a line vs scatter plot?¶
- Use line plots for sequential data if:
- ... your x-axis has an order
- ... sequential differences in y values are meaningful
- ... there's only one y-value for each x-value
- Usually: x-axis is time or distance
- Use scatter plots for non-sequential data --- When you’re looking for associations
Demo: Census¶
Explore the US Census data from the Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.
(Release date: June 2021, Updated January 2022 to include April 1, 2020 estimates)
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full = Table.read_table(url)
full
| SEX | AGE | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | POPESTIMATE2019 | POPESTIMATE2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3944153 | 3944160 | 3951495 | 3963264 | 3926731 | 3931411 | 3954973 | 3984144 | 3963268 | 3882437 | 3826908 | 3762227 | 3735010 |
| 0 | 1 | 3978070 | 3978090 | 3957904 | 3966768 | 3978210 | 3943348 | 3949559 | 3973828 | 4003586 | 3981864 | 3897917 | 3842257 | 3773884 |
| 0 | 2 | 4096929 | 4096939 | 4090799 | 3971498 | 3980139 | 3993047 | 3960015 | 3967672 | 3992657 | 4021261 | 3996742 | 3911822 | 3853025 |
| 0 | 3 | 4119040 | 4119051 | 4111869 | 4102429 | 3983007 | 3992839 | 4007852 | 3976277 | 3984985 | 4009060 | 4035053 | 4009037 | 3921526 |
| 0 | 4 | 4063170 | 4063186 | 4077511 | 4122252 | 4112849 | 3994539 | 4006407 | 4022785 | 3992241 | 4000394 | 4021907 | 4045996 | 4017847 |
| 0 | 5 | 4056858 | 4056872 | 4064653 | 4087770 | 4132349 | 4123745 | 4007123 | 4020489 | 4038022 | 4007233 | 4012789 | 4032231 | 4054336 |
| 0 | 6 | 4066381 | 4066412 | 4073031 | 4075153 | 4097860 | 4142923 | 4135738 | 4020428 | 4034969 | 4052428 | 4019106 | 4022432 | 4040169 |
| 0 | 7 | 4030579 | 4030594 | 4043100 | 4083399 | 4085255 | 4108453 | 4154947 | 4148711 | 4034355 | 4048430 | 4063647 | 4027876 | 4029753 |
| 0 | 8 | 4046486 | 4046497 | 4025624 | 4053313 | 4093553 | 4096033 | 4120476 | 4167765 | 4162142 | 4047130 | 4059209 | 4071894 | 4034785 |
| 0 | 9 | 4148353 | 4148369 | 4125413 | 4035854 | 4063662 | 4104437 | 4107986 | 4133426 | 4181069 | 4175085 | 4058207 | 4067320 | 4078668 |
... (296 rows omitted)
In the previous lecture, we did the following:
- Select the
SEX,AGE,CENSUS2010POP, andPOPESTIMATE2019columns. - Relabel the 2010 and 2019 columns.
- Remove the 999 ages and focus just on the combined data where the
SEXvalue is 0. Drop theSEXcolumn since there is only one value there.
partial = full.select('SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2019')
simple = partial.relabeled(2, '2010').relabeled(3, '2019')
no_999 = simple.where('AGE', are.below(999))
everyone = no_999.where('SEX', 0).drop('SEX')
everyone
| AGE | 2010 | 2019 |
|---|---|---|
| 0 | 3944153 | 3762227 |
| 1 | 3978070 | 3842257 |
| 2 | 4096929 | 3911822 |
| 3 | 4119040 | 4009037 |
| 4 | 4063170 | 4045996 |
| 5 | 4056858 | 4032231 |
| 6 | 4066381 | 4022432 |
| 7 | 4030579 | 4027876 |
| 8 | 4046486 | 4071894 |
| 9 | 4148353 | 4067320 |
... (91 rows omitted)
Visualize the relationship between age and population size in 2010.
# can use .plot (a line graph) because there is only one y per age and age is sequestial
everyone.plot('AGE', '2010')
plots.title('US Population Size')
plots.show()
Include lines for both 2010 and the estimated 2019 population sizes.
everyone.plot('AGE')
# if you leave off second argument, it will plot a line for each remaining column
# as long as it makes sense
plots.title('US Population Size')
plots.show()
Demo: Male and Female 2019 Estimates¶
Create a table with Age, Males, Females columns showing the population estimates in 2019 for males and females by age.
males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')
pop_2019 = Table().with_columns(
'Age', males.column('AGE'),
'Males', males.column('2019'),
'Females', females.column('2019')
)
pop_2019
| Age | Males | Females |
|---|---|---|
| 0 | 1921001 | 1841226 |
| 1 | 1963261 | 1878996 |
| 2 | 2000102 | 1911720 |
| 3 | 2048651 | 1960386 |
| 4 | 2068251 | 1977745 |
| 5 | 2063176 | 1969055 |
| 6 | 2055583 | 1966849 |
| 7 | 2058425 | 1969451 |
| 8 | 2082403 | 1989491 |
| 9 | 2075719 | 1991601 |
... (91 rows omitted)
Visualize the distribution of of population size for both males and females.
pop_2019.plot('Age')
plots.title('2019 Population Size Estimates')
plots.show()
Calculate the percent female for each age
# need (number of females / total number of people) * 100
total = pop_2019.column('Females') + pop_2019.column('Males')
pct_female = (pop_2019.column('Females') / total) * 100
pct_female
array([ 48.93979018, 48.90344399, 48.87032181, 48.89917454,
48.88153622, 48.83289177, 48.89701056, 48.89552211,
48.85910586, 48.96592842, 48.98425388, 48.96313718,
48.91848904, 48.91588355, 48.95682562, 48.99213593,
49.00723665, 48.9917086 , 48.94499775, 48.85555766,
48.8800806 , 48.89699809, 48.95129043, 48.84655675,
48.77220901, 48.76311842, 48.68996749, 48.84567382,
49.115004 , 49.23311185, 49.27161137, 49.33570713,
49.34690992, 49.39653681, 49.57328862, 49.7823678 ,
49.88801204, 49.99258886, 50.08019625, 49.89892133,
50.1409379 , 50.20977831, 50.37327215, 50.36508359,
50.27570341, 50.48253869, 50.64261911, 50.57544456,
50.61870656, 50.44489454, 50.56911629, 50.63449931,
50.80649435, 50.81894266, 50.89138769, 51.13627062,
51.2696241 , 51.37238838, 51.53410868, 51.46437873,
51.72648051, 51.88456258, 52.09723728, 52.31329221,
52.44314993, 52.76149769, 52.92230043, 53.03484444,
53.26468499, 53.27081102, 53.40722561, 53.44223716,
53.51022877, 53.9509406 , 54.25448772, 54.58073446,
54.83251151, 55.26819948, 55.82854715, 56.17047137,
56.3748233 , 57.03744511, 57.64539476, 58.2875019 ,
59.12037315, 59.77448788, 60.61994754, 61.50555207,
62.43469375, 63.42875214, 64.36264302, 65.56129226,
66.59478489, 67.76493653, 69.03326813, 70.06426052,
70.77789932, 72.11473518, 72.70429851, 74.48479938, 76.57254933])
Round the values to 3 decimal places so that it's easier to read.
pct_female = np.round(pct_female, 3)
pct_female
array([ 48.94 , 48.903, 48.87 , 48.899, 48.882, 48.833, 48.897,
48.896, 48.859, 48.966, 48.984, 48.963, 48.918, 48.916,
48.957, 48.992, 49.007, 48.992, 48.945, 48.856, 48.88 ,
48.897, 48.951, 48.847, 48.772, 48.763, 48.69 , 48.846,
49.115, 49.233, 49.272, 49.336, 49.347, 49.397, 49.573,
49.782, 49.888, 49.993, 50.08 , 49.899, 50.141, 50.21 ,
50.373, 50.365, 50.276, 50.483, 50.643, 50.575, 50.619,
50.445, 50.569, 50.634, 50.806, 50.819, 50.891, 51.136,
51.27 , 51.372, 51.534, 51.464, 51.726, 51.885, 52.097,
52.313, 52.443, 52.761, 52.922, 53.035, 53.265, 53.271,
53.407, 53.442, 53.51 , 53.951, 54.254, 54.581, 54.833,
55.268, 55.829, 56.17 , 56.375, 57.037, 57.645, 58.288,
59.12 , 59.774, 60.62 , 61.506, 62.435, 63.429, 64.363,
65.561, 66.595, 67.765, 69.033, 70.064, 70.778, 72.115,
72.704, 74.485, 76.573])
Add female percent to our table
pop_2019 = pop_2019.with_column('Percent female', pct_female)
pop_2019
| Age | Males | Females | Percent female |
|---|---|---|---|
| 0 | 1921001 | 1841226 | 48.94 |
| 1 | 1963261 | 1878996 | 48.903 |
| 2 | 2000102 | 1911720 | 48.87 |
| 3 | 2048651 | 1960386 | 48.899 |
| 4 | 2068251 | 1977745 | 48.882 |
| 5 | 2063176 | 1969055 | 48.833 |
| 6 | 2055583 | 1966849 | 48.897 |
| 7 | 2058425 | 1969451 | 48.896 |
| 8 | 2082403 | 1989491 | 48.859 |
| 9 | 2075719 | 1991601 | 48.966 |
... (91 rows omitted)
Visualize the relationship between age and the percent of the population that is female.
pop_2019.plot('Age', 'Percent female')
plots.title('Female Population Percentage over Age')
plots.show()
Be careful of being visually mislead by the y-axis.
# if we include the whole range 0-100, it is a much less dramatic increase
pop_2019.plot('Age', 'Percent female')
plots.ylim(0, 100);
plots.title('Female Population Percentage over Age')
plots.show()
Demo: Scatter Plots¶
Visualize the relationship between the number of movies and the average pay per movie for each actor in the dataset.
actors.scatter('Number of Movies', 'Average per Movie')
plots.title('Average per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()
Identify the outlier in the dataset.
# one way
actors.where('Average per Movie', are.above(400))
# another way
# actors.sort('Average per Movie', True).row(0).item(0)
| Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
|---|---|---|---|---|---|
| Anthony Daniels | 3162.9 | 7 | 451.8 | Star Wars: The Force Awakens | 936.7 |
max(actors.column('Average per Movie'))
451.80000000000001
max_ave = max(actors.column('Average per Movie'))
actors.where('Average per Movie', max_ave).column('Actor').item(0)
'Anthony Daniels'
[Optional] Demo: Scatter Plots¶
Again, for all the visualization methods we use from the datascience library, if you put an i infront of the name of the visualization, you can access an interactive version of plot that is based on another visualization library called Plotly. You will not be tested on your knowledge of these interactive plots. You might find them helpful for exploring the data.
actors.iscatter(
column_for_x='Number of Movies',
select='Average per Movie',
labels='Actor',
title='Average per Movie (Thousands of Dollars) vs. Number of Movies'
)